I am stuck in my home as most people these days due to the coronavirus, and I feel nostalgic of the time when I was able to freely run in nature and public parks. To remember those goold old days, I decided to create a visualization of the runs I used to do in the “Bois de Vincennes” park, when I lived in Paris. The idea is to see where I used to run in the most : a heatmap will be perfect in this regard.

For this little project, I am going to use R and the indispensable ggplot library combined with ggmap. I will also give a try to the leaflet for r library that I’ve never used before.

library(tidyverse)
library(xml2)
library(tools)
library(ggmap)
library(leaflet)
library(sp)
library(rgdal)
library(KernSmooth)

1 Retrieving the data

I am retrieving my data from the Strava website : all of my runs are recorded on my Strava account. This is pretty handy because Strava lets you bulk download all of your files : have a look at this help page.

The zip file that you download contains a lot of files. Only some of them will interest us :

 '<trkseg>
   <trkpt lat="48.8496940" lon="2.4355100">
    <ele>61.2</ele>
    <time>2016-02-05T17:36:12Z</time>
   </trkpt>
   <trkpt lat="48.8498080" lon="2.4354640">'

As you may have guessed, the next step will be to convert these files into a single tidy dataframe. This is the purpose of the next session.

2 Cleaning the data

The first thing we want to do before parsing all of the files is to filter only the files which contains runs coordinates. The file activities.csv is perfect for that, let’s load it and explore it.

df_activities <- read_csv("../data/activities.csv")
str(df_activities)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 106 obs. of  10 variables:
##  $ id          : num  4.60e+08 4.67e+08 4.69e+08 4.71e+08 4.72e+08 ...
##  $ date        : POSIXct, format: "2016-01-01 10:26:36" "2016-01-10 08:40:06" ...
##  $ name        : chr  "Course à pied du midi" "Course à pied matinale" "Course à pied en soirée" "Course à pied matinale" ...
##  $ type        : chr  "Run" "Run" "Run" "Run" ...
##  $ description : logi  NA NA NA NA NA NA ...
##  $ elapsed_time: num  2736 3835 1594 4567 2473 ...
##  $ distance    : num  8343 7174 5213 12968 6799 ...
##  $ commute     : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ gear        : logi  NA NA NA NA NA NA ...
##  $ filename    : chr  "activities/460488679.gpx" "activities/467047992.gpx" "activities/468881650.gpx" "activities/471346941.gpx" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   id = col_double(),
##   ..   date = col_datetime(format = ""),
##   ..   name = col_character(),
##   ..   type = col_character(),
##   ..   description = col_logical(),
##   ..   elapsed_time = col_double(),
##   ..   distance = col_double(),
##   ..   commute = col_logical(),
##   ..   gear = col_logical(),
##   ..   filename = col_character()
##   .. )

What a convenient file ! All of the records are logged in and we immediately know the type of activity thanks to the “type” column. Even better, the relative path to the gpx file is given for each activity.

Let’s filter this data to keep only the run activities. As most of the columns are irrelevant for our analysis, we will remove them as well.

df_activities <- df_activities %>%
  filter(type == "Run") %>%
  select(id, date, type, elapsed_time, distance, filename)

head(df_activities)
## # A tibble: 6 x 6
##         id date                type  elapsed_time distance filename        
##      <dbl> <dttm>              <chr>        <dbl>    <dbl> <chr>           
## 1   4.60e8 2016-01-01 10:26:36 Run           2736    8343. activities/4604~
## 2   4.67e8 2016-01-10 08:40:06 Run           3835    7174. activities/4670~
## 3   4.69e8 2016-01-12 17:42:57 Run           1594    5213. activities/4688~
## 4   4.71e8 2016-01-16 09:39:53 Run           4567   12968. activities/4713~
## 5   4.72e8 2016-01-17 13:23:44 Run           2473    6799. activities/4723~
## 6   4.74e8 2016-01-19 17:32:13 Run           3343   11238. activities/4738~

That’s better. Now it is time to have a look at the gpx file. As we saw, it is a xml file, so it will be easy to parse it with xml2 library and the xpath query langage.
Let’s try that first with a single file.

gpx_file <- "../data/activities/486549502.gpx"
test_file <- read_html(gpx_file)
trkpt_nodes <- xml_find_all(test_file, "//trkpt") # use xpath to parse all trkpt nodes

# collect GPS coordinates (lon / lat) within these nodes
lon <- as.numeric(xml_attr(trkpt_nodes, attr = "lon"))
lat <- as.numeric(xml_attr(trkpt_nodes, attr = "lat"))

# create a dataframe with the run id and the GPS coordinates
  id <- gpx_file %>%
    strsplit("/") %>%
    unlist() %>%
    tail(1) %>%
    file_path_sans_ext() %>%
    as.numeric()

df_coordinates <- data.frame("id" = id, lon, lat)
head(df_coordinates)
##          id      lon      lat
## 1 486549502 2.435510 48.84969
## 2 486549502 2.435464 48.84981
## 3 486549502 2.435476 48.84975
## 4 486549502 2.435366 48.84969
## 5 486549502 2.435194 48.84973
## 6 486549502 2.435159 48.84972

And “voila” ! That was pretty simple thanks to xpath. We could have retrieve other attributes such as elevation or time, but there are of no use for this visualization. The id column will help us merge this table with our “activities” dataframe.

It is time to replicate what we just did to the other files. The code has already been written, we just need to insert it into function and process all the files in the “activities” dataframe.

# this function retrieves the gps coordinates of a gpx file
gpx_coordinates_retriever <- function(gpx_file){
  x <- read_html(gpx_file)
  trkpt_nodes <- xml_find_all(x, "//trkpt") # use xpath to parse all trkpt nodes
  
  # collect GPS coordinates (lon / lat) within these nodes
  lon <- as.numeric(xml_attr(trkpt_nodes, attr = "lon"))
  lat <- as.numeric(xml_attr(trkpt_nodes, attr = "lat"))
  
  # create a dataframe with the run id and the GPS coordinates
  id <- gpx_file %>%
    strsplit("/") %>%
    unlist() %>%
    tail(1) %>%
    file_path_sans_ext() %>%
    as.numeric()
  
  df_coordinates <- data.frame("id" = id, lon, lat, stringsAsFactors = FALSE)
  df_coordinates
}

files_path <- paste0("../data/",df_activities$filename) # create a valid path to access gpx files
list_coordinates <- lapply(files_path, gpx_coordinates_retriever) # apply the function to all files
df_total_coordinates <- do.call("bind_rows", list_coordinates) # concatenate all dataframe in a single one

# join our coordinates dataframe with the activities one
df_activities_coordinates <- left_join(df_activities, df_total_coordinates, by = "id")
head(df_activities_coordinates)
## # A tibble: 6 x 8
##       id date                type  elapsed_time distance filename    lon
##    <dbl> <dttm>              <chr>        <dbl>    <dbl> <chr>     <dbl>
## 1 4.60e8 2016-01-01 10:26:36 Run           2736    8343. activit~ -0.646
## 2 4.60e8 2016-01-01 10:26:36 Run           2736    8343. activit~ -0.646
## 3 4.60e8 2016-01-01 10:26:36 Run           2736    8343. activit~ -0.646
## 4 4.60e8 2016-01-01 10:26:36 Run           2736    8343. activit~ -0.646
## 5 4.60e8 2016-01-01 10:26:36 Run           2736    8343. activit~ -0.645
## 6 4.60e8 2016-01-01 10:26:36 Run           2736    8343. activit~ -0.645
## # ... with 1 more variable: lat <dbl>

This looks great, but we’re not done yet : to optimize the creation of the charts, and as I’m only interested in my runs in the Bois de Vincennes park, I will filter to remove gps coordinates which are clearly not in this area.

A quick look at OpenStreetMap and I see that the Bois de Vincennes park coordinates ranges from 48.8125 to 48.8505 for the latitude, and 2.3838 to 2.4856 for the longitude.

#bois de vincennes screenshot

lat_min = 48.8125
lat_max = 48.8505

lon_min = 2.3838
lon_max = 2.4856

df_activities_coordinates <- df_activities_coordinates %>%
  filter(lat >= lat_min & lat <= lat_max ) %>%
  filter(lon >= lon_min & lon <= lon_max )

3 Visualizing data with ggplot2

Let’s first start by making a simple plot : we are going to plot every coordinates as a dot. We need a background map before plotting though, that’s the role of the ggmap library which gets along very well with ggplot2.

This library enables to use a lot of different sources as your base map, from Google Maps to OpenStreetMap. However their usage has been restricted quite a lot lately : OpenStreetMap is no longer supported at the moment (see this Github issue) and you need now to register to Google services to use their API, even if it remains “mostly” free for limited usage. I am using Stamen as it does not require a API key and is based on open source data.

map <- get_map(location=c(left = lon_min,
                          bottom = lat_min,
                          right = lon_max,
                          top = lat_max),
               zoom = 14, 
               source="stamen",
               maptype = "watercolor")

mapPoints <- ggmap(map)

We now have our base map. All we need to do is to plot the dots. This is an easy step : as mentionned, ggmap gets along very well with ggplot. And as it is pretty straighforward to build a scatterplot with ggplot, we can quickly generate our first draft.

Note that I set the alpha parameter (the transparency) of the dots to 0.01 (on a scale from 0 to 1) to have a first idea of the most frequent areas.

mapPoints +  
  geom_point(data = df_activities_coordinates, 
  aes(x = lon, y = lat), 
  alpha = 0.01)

This is not too bad for a first draft ! I really love ggplot for its simplicity and its efficiency : only a few lines of code to get a decent chart.

Now to move on to the next level, I am going to generate a density chart : the “busiest” area will be colored red while the places I did not go very frequently will be blue. To be able to better see the colors, I will use another base map for Stamen : the toner map, which is basically a black & white version of the map. I add some titles, fine tune some visual parameters and this chart should be good to publish (at least on this blog) !

mapBW <- get_map(location=c(left = lon_min,
                          bottom = lat_min,
                          right = lon_max,
                          top = lat_max),
               zoom = 14, 
               source="stamen",
               maptype = "toner")

mapPointsBW <- ggmap(mapBW)

mapPointsBW +  
  stat_density_2d(data = df_activities_coordinates, 
  aes(x = lon, y = lat, fill = ..level.., alpha = ..level..), geom = "polygon")+
  scale_alpha(range = c(0.3, 1))+
  scale_fill_gradient(low = "blue", high = "red") + 
  labs(title = "Runs in Bois de Vincennes", 
       subtitle = "Distribution of GPS coordinates", 
       caption = "Map tiles by Stamen Design, under CC BY 3.0. Data by OpenStreetMap, under ODbL.")

We can clearly see on this map that I was particularly active in the north of the park, there is especially an “intensive” path that leads to the place I used to live in (outside of the map). On the south-west side of the map, there is a red-ish area that is visible : this a hill that I particularly enjoyed climbing to get a panoramic view of the park. But the most obvious thing we can deduct from this map is that I missed quite a big part of the park during my runs. In my defense, it is a pretty big park : 3 times the size of Central Park !

4 Interactive visualization with Leaflet

Static maps are nice, but interactive maps are even better ! When it’s possible, I always prefer to create interactive visualization to let the audience “play” with the data.

I generally rely on Folium, a great Python library based on Leaflet which is… a Javascript library that enables to create interactive maps for the Web. I had never tried before to do the same with R, so I decided to give it a try. Unfortunately there is no true equivalent of Folium in the R ecosystem (yet ?). I have tested many libraries but none of them have all the capabilities found in Folium. Most of them are moreover not effectively maintained.

Rstudio, the company behind the famous IDE, has created the leaflet for r library. It is very effective to create basic interactive maps but is not able to create heatmaps from scratch. Fortunately with the help of other packages, it is possible to achieve it. I found this particularly interesting thread on StackExchange that show an elegant solution to this problem. I tweaked it a bit (get rid of data.table, change the bandwidth) to adjust it to my needs and here is our beautiful interactive map :

# Make Contour lines - bandwidth choice is based on MASS::bandwidth.nrd()
kde <- bkde2D(df_activities_coordinates %>% select(lon, lat) %>% as.data.frame(),
              bandwidth=c(.0009, .0014), gridsize = c(200,100))
CL <- contourLines(kde$x1 , kde$x2 , kde$fhat)

# Extract Contour Lines level
LEVS <- as.factor(sapply(CL, `[[`, "level"))
NLEV <- length(levels(LEVS))

# Convert Contour Lines to Polygons
pgons <- lapply(1:length(CL), function(i)
    Polygons(list(Polygon(cbind(CL[[i]]$x, CL[[i]]$y))), ID=i))
spgons = SpatialPolygons(pgons)

# Leaflet map with polygons
leaflet(spgons) %>% addTiles() %>% 
    addPolygons(color = heat.colors(NLEV, NULL)[LEVS])

Sadly the main insight is the same : I need to go back to Paris and to run a lot to complete my exploration of the Bois de Vincennes !

5 Conclusion

In conclusion, R provides some nice tools to create maps. I am particularly fan of the combination of ggplot2 with ggmap as they enable to quickly build ready-to-publish charts with only a few lines of code. I used this a lot for my work when I was working for a transport reseller : a map enables to quickly grasp spatial information.

I am less enthusiastic for the making of interactive maps with R. It is maybe because I am used to do it with Python (and Folium), but I found it a bit more tedious in R as there are not a single package designed to do this kind of visualization. But leaflet for R remains a good solution for building basic interactive maps.